Hands-on 2: Data I/O and Distribution Analysis in Python¶

Recap Section¶

Before diving into practical analysis, let's review key concepts about probability distributions:

  • Probability Density Function (PDF): Describes the relative likelihood of a random variable taking a specific value
  • Cumulative Distribution Function (CDF): Probability that a random variable takes a value less than or equal to a given value
  • Parameters: Values that define the shape and location of a distribution
    • Location parameters (e.g., mean, median)
    • Scale parameters (e.g., standard deviation)
    • Shape parameters (e.g., skewness)

Common distributions we'll encounter:

  • Normal (Gaussian): Symmetric, bell-shaped distribution
  • Log-normal: Right-skewed, only positive values
  • Gamma: Right-skewed, flexible shape
  • Beta: Bounded between 0 and 1, flexible shape

Problem Statement¶

In this hands on we will have different practical occasion to learn:

  1. Load and inspect data from different file formats
  2. Visualize distributions using various plotting techniques
  3. Calculate summary statistics and interpret their meaning
  4. Identify appropriate distributions for different types of measurements

We will also play with the families of distribution exposed by scipy.stats and visualize: 5. The characteristic domain and shape of the different distributions. 6. How distributions change with changes of parameters.

We will analyze a dataset containing biological aging markers:

Telomere Length Measurements (telomere_length.tsv)

  • Measured in base pairs (typical range: 5,000-15,000 bp)
  • Known to decrease with age (50-100 bp/year)
  • Includes metadata: patient ID, age, gender, measurement date, batch
  • Stored in TSV format
  • Expected to follow approximately normal distribution with age-dependent mean

DNA Methylation Levels (dna_methylation.tsv)

  • Measurements from 4 age-related CpG sites
  • Values bounded between 0 and 1 (proportion methylated)
  • Includes experimental metadata: plate ID, well position
  • Stored in TSV format
  • Follows beta distribution with age-dependent parameters
  • Each CpG site has distinct methylation patterns

Histone Modification Counts (protein_modifications.csv)

  • Count data for 4 key histone modifications (H3K4me3, H3K27me3, H3K9ac, H4K20me1)
  • Discrete count data measuring post-translational modifications
  • Includes technical replicates and batch information
  • Stored in CSV format
  • Follows negative binomial distribution (overdispersed count data)
  • Different proteins show varying baseline modification rates
In [ ]:
# General imports - here we import the libraries we will use in this notebook
import numpy as np  # NumPy, a fundamental package for scientific computing with Python.

import pandas as pd # Pandas, a library providing high-performance, easy-to-use data structures and data analysis tools for Python
from scipy import stats # SciPy, a Python-based ecosystem of open-source software for mathematics, science, and engineering.

import matplotlib.pyplot as plt # Matplotlib, a 2D plotting library which produces publication quality figures in a variety of formats
import seaborn as sns # Seaborn, a Python visualization library based on Matplotlib, provides a high-level interface for drawing attractive statistical graphics.

Deliverables¶

Task 1. Telomere Length Analysis

  • Create a visualization showing how telomere length distributions for different covariates, use histograms (check pandas function hist). You should hand in 3 plots side by side (use plt.subplots) containing the following:
    1. Gender: Two histograms overlayed on top of the other (check the parameter "alpha" in the .hist() function) showing the telomere length distribution. One for males and the other for females. Hint: The pandas function groupby with a call to .hist() after can do most of the work.
    2. Batch:Three histograms overlayed, one for each batch. You can do it in a similar way than with gender.
    3. Age: For the last covariate, age, we first need to define a fixed number of age groups to plot multiple histograms as we did before. Choose the distinction you see fit (e.g Young, Middle, Old or 31-40, 41-50) and create a new variable with the age groups. Combine the function pd.cut with pandas apply. Again, provide a set of n overlayed histrograms (one per age group). Alternatively, you can use the function pd.qcut to define the groups.
  • Use Box and Wiskers plot to show the different distributions of telomere. Annotate the plot with text detailing mean and variance (use ax.text) of each group. You should hand in the following 3 plots:
    1. Gender: The first figure should contain two boxplots, one for males and the other for females. You can plot them vertically one next to the other, so that the x-axis is the gender and the y-axis the telomere length.
    2. Batch: Same as with gender, now you should have 3 boxplots.
    3. Age: Use the same age groups you created before so that you have n boxplots side by side.
  • Study telomere length distribution, using quantile analysis (q-q plots) to assess normality assumptions. This corresponds to one qq-plot of the telomere length variable. Hint: Use the function stats.probplot to obtain the theoretical and empirical quantiles (x and y of the plot respectively) and then plot them with plt.scatter.

Task 2. DNA Methylation Analysis

In this part, we will generate multi-panel comparisons of methylation patterns across the four CpG sites ('cg00000292', 'cg01333755', 'cg02872426', 'cg03388593'). You will have to deliver the following plots:

  • For each of the the four CpG sites, make a longitudinal plot (plt.scatter) highlighting how these patterns shift with age. The x-axis should correspond to age and the y-axis to methylation level. You should deliver 4 plots, one per methylation level. Try to arrange them in a grid using "plt.subplots(2, 2)".
  • Repeat the same 4 plots as before, but now the y-axis should be the logit transform of the methylation level. Use the function apply combined with logit to create a new column which should be the new y-axis. Hint: When applying the function logit, you might encounter some NA values. To avoid this, use the function clip as follows: .clip(1e-6, 1 - 1e-6)
  • Use a heatmap to make a visual QC plot show conditions vs across experimental plates to reveals potential technical biases in the methylation measurements. This should correspond to 6x4=24 heatmaps arranged in a 6x4 grid (use plt.subplots(6, 4)). Each individual heatmap corresponds to a specific combination of one of the 4 CpG sites with one of the 6 plates (P1-P6). In this heatmap, we will represent the value of the "MethylationLevel" column for each position in the plate (variable "Position"). If you check the Position variable you will notice that it goes from A1-H12 which give the coordinates in the plate. The letter represents the row and the number the column, for example B3 would correspond to the square in row 2 and column 3. This is what the heatmap should represent: a grid of 8x12 where the color is the "MethylationLevel" at each position.

Task 3. Protein Modification Analysis

  • Select a protein (i.e. H3K4me3) and show a plot to inspect the data and guess the type and parameters of distribution it comes from (at least one condition). You should deliver one histogram corresponding to the distribution of the variable "ModificationCount" for the selected protein. Hint: For guessing the distribution, think of the type of variable you are dealing with (Discrete Counts) and what known distribution(s) you have seen during the lectures that might be adequate for this type of data.
  • Verify your guess via two plots:
    1. Q-Q plot: Use a QQ-plot to compare the the theoretical and empirical quantiles. Check the argument "dist" of the function probplot.
    2. Histogram and probability density function: Plot again the histogram of the variable and now try to overlay on top the theoretical distribution. You should choose the parameter(s) of the distribution you think are adequate (you can try some values by hand or try to estimate them, in lecture 2 you will see how to properly estimate them). To plot the theoretical distribution, use the "pmf" function from the distribution of your choice from scipy.stats Hint: If you do not find a good fit it is ok, submit the figures with your guess.

IMPORTANT: Format Requirements¶

  • All plots must include:
    • Clear title (plt.title or ax.set_title())
    • Properly labeled axes with units (check functions plt.xlabel() or ax.set_xlabel())
    • Legend where appropriate (plt.legend())
    • Font size ≥ 10pt (usually parameters of the label functions)
    • Grid lines where appropriate (plt.grid() or ax.grid())

Summary of deliverables:¶

Task 1:

  • 3 sets of histograms.
  • 3 sets of boxplots.
  • 1 QQ-plot.

Task 2:

  • 4 scatterplots.
  • 4 scatterplots with logit transform.
  • Grid of 6x24 heatmaps.

Task 3:

  • 1 histogram.
  • 1 histogram with pdf overlayed.
  • 1 QQ-plot.

Provide a pdf file containing all required plots, you can do it as you want. Some examples:

  • Export the notebook as pdf (in vs code comand/ctrl+shift+P and then search for export).
  • Copy paste all figures to a word file or docs and export as pdf.

Task 1¶

In [ ]:
# Task 1 - Part 1

# TODO;Read the data
telomere_df = ... # Hint: Use pd.read_csv() function

# TODO: Create age groups for visualization

# Create figure
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 5))

# TODO; Gender distribution

# TODO: Batch distribution

# TODO: Age groups distribution


# Show subplots
plt.tight_layout()
plt.show()
In [ ]:
# Task 1 - Part 2

# Box plots with statistics
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 5))

# TODO: Gender boxplot with statistics


# TODO: Batch boxplot with statistics

# TODO: Age group boxplot with statistics


# Show subplots
plt.tight_layout()
plt.show()
In [ ]:
# Task 1 - Part 3

# Distribution analysis with pdf and Q-Q plot
fig, ax1 = plt.subplots(1, 1, figsize=(5, 5))

# TODO: Extract quantiles using stats.probplot() function

# TODO: Plot them using plt.scatter

# Show plot
plt.tight_layout()
plt.show()

Task 2¶

In [ ]:
# TODO: Read the methylation data
methylation_df = ... # Hint: Use pd.read_csv() function
In [ ]:
# Task 2 - Part 1

# First plot - CpG site comparison
# Create 4 subplots in a 2x2 grid
fig, axes = plt.subplots(2, 2, figsize=(7, 7))


# TODO: Plot each methylation level vs age of each CpG site in its subplot


# Show subplots
plt.tight_layout()
plt.show()
In [ ]:
# Task 2 - Part 2

from scipy.special import logit

# TODO: Create new column with logit of methylation level
methylation_df['Methylation_logit'] = # Hint: Use apply() to the methylation column

# Create subplots
fig, axes = plt.subplots(2, 2, figsize=(7, 7))

# TODO: Plot each methylation level vs age of each CpG site in its subplot

plt.tight_layout()
plt.show()
In [ ]:
# Task 2 - Part 3

# Create a heatmap per plate (6) and per methylation site (4)
# Each heatmap represents a 96 weel plate (8 x 12 grid - A-H/1-12)

# Create a 6x4 grid of subplots
fig, axes = plt.subplots(6, 4, figsize=(12, 12))

# TODO: Create a heatmap for each plate and each CpG site

# Show subplots
plt.tight_layout()
plt.show()

Task 3¶

In [ ]:
# Task 3 - Part 1

# TODO: Read protein modification data
protein_df = ... # Hint: Use pd.read_csv() function

# TODO: Let's focus on H3K4me3 as an example protein
protein = # "H3K4me3"

# Create figure for distribution inspection
fig, ax = plt.subplots(figsize=(10, 6))

# TODO: Create histogram of counts


plt.show()
In [ ]:
# Task 3 - Part 2

# Create figure for distribution inspection
fig, ax1 = plt.subplots(figsize=(7, 7))

# TODO: Create (again) histogram of counts

# TODO: Get points of the theoretical distribution
x = ... # Hint: Check np.arange
y = ... # Hint: Use pmf() function from the distribution of your choice

# TODO: Plot the theoretical distribution (use ax1.plot)


plt.show()
In [ ]:
# Task 3 - Part 3

f, ax = plt.subplots(figsize=(7, 7))
# TODO: Create Q-Q plot to assess fit

plt.show()